Methods for dynamic document generation

ABSTRACT

Dynamic web page generation is optimized by reducing the processing overhead required to parse the web page HTML code for tokens and insert dynamic content. Using the invention, an HTML file for a dynamic web page need be read and parsed only once throughout the life of the server. A software object parses the HTML, decomposes the page into constituent pieces and saves them to data structures as byte streams, which are cached, along with the software object, rendering multiple disk accesses unnecessary when the page is reconstituted. For subsequent requests, the dynamic page is created from the cached version, which is shareable across users and across requests. The optimization reduces server resource usage for dynamic page generation to near zero. The invention is also applicable to other documents combining static and dynamic content that require composition tools for editing.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.12/719,354 filed on Mar. 8, 2010, which is a continuation of U.S.application Ser. No. 12/154,680 filed on May 23, 2008 and issued as U.S.Pat. No. 7,703,010, which is a continuation of and claims prioritybenefit under 35 U.S.C. §120 to U.S. patent application Ser. No.10/203,037, filed Aug. 2, 2002 (published as U.S. Publ. No. 2003-0014443A1) entitled “HIGH PERFORMANCE FREEZE-DRIED DYNAMIC WEB PAGEGENERATION,” which is the U.S. National Phase under 35 U.S.C. §371 ofInternational No. PCT/US01/03424, filed Feb. 1, 2001 (published as WO01/57721 A2), entitled “HIGH PERFORMANCE FREEZE-DRIED DYNAMIC WEB PAGEGENERATION,” which claims priority to U.S. Provisional Application No.60/180,394, filed Feb. 4, 2000, entitled “HIGH PERFORMANCE FREEZE-DRIEDDYNAMIC WEB PAGE GENERATION,” all of which are hereby incorporatedherein by reference in their entirety.

COPYRIGHT NOTICE

A portion of the disclosure of this patent document contains materialthat is subject to copyright protection. The copyright owner has noobjection to the reproduction by anyone of the patent document or thepatent disclosure, as it appears in the Patent and Trademark Office fileor records, but otherwise reserves all copyright rights whatsoever.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The invention relates to processing of electronic documents. Moreparticularly the invention relates to a method of optimizing generationof web pages having dynamic content.

2. Description of Related Technology

Today's Internet websites must deliver an ever-increasing amount ofdynamic web page content. Dynamic web page generation is the processwhereby a server computer creates HTML “on the fly” to send to a clientcomputer (a Web browser). Dynamic web pages differ from static web pagesin that the content of a dynamic web page can only be determined themoment a page request is received by the server computer. While a staticweb page might display a biography of Abraham Lincoln, content which canbe created once and not changed anymore, such a web page and methodologywould not be suitable for a web page which displayed the current priceof oranges at five local supermarkets. The latter case requires that theserver computer utilize dynamic information and compose that informationinto a web page to send to a client computer.

A common practice employed to aid in the creation of dynamic web pagesis the use of HTML containing “tokens”, or “tokenized HTML”. A tokenizedHTML file contains some never-changing static information, for example apage heading with the word “Welcome” in it, but also contains somedynamic or “live” areas; for example, an area after “Welcome” where theuser's name is to be dynamically placed. This will allow each user tosee a Web page that is customized for them. When Sally visits this webpage she'll be greeted with a page title that says “Welcome Sally”, andwhen Joe visits this web page it will be titled, “Welcome Joe”. One ofthe major advantages of using tokens as placeholders for dynamic contentis that they are extremely unobtrusive, allowing technical personnelsuch as programmers to make sure that dynamic content is placed incertain areas of the page without the necessity of embedding complicatedsource code in the HTML, which may be very confusing and distracting tosomeone such as a graphic designer, who is tasked with maximizing thepage's aesthetic appeal.

To serve up dynamic web pages, a web server typically creates a dynamicpage by loading up a static HTML page with a “token” or “placeholder” inthe area where the user's name went. The tokens are of a known form; forexample, “@UserName@,” so that they may be searched for quickly anduniquely. The server searches the page looking for the tokens that referto dynamic content, e.g. “@UserName@.” Once the token has been located,the server replaces its text with the dynamically discovered text, e.g“Sally.” Replacing a token involves storing all of the text leading upto the token and concatenating it with the dynamic content and all ofthe text following the token. It must do this for each request itreceives (each dynamic page that each user asks for).

Various methods of creating documents with varying content have beenproposed. For example, J. Cooper, M. San Soucie, Method of generatingdocument using tables storing pointers and indexes, U.S. Pat. No.4,996,662 (Feb. 26, 1991) describe a document processing system having asystem architecture that includes a control structure providingsupervisory routines for controlling supervisory functions of the systemand document manipulation routines for operating upon the documents.

R. Smith, D. Ting, J. Boer, M. Mendelssohn, Document management andproduction system, U.S. Pat. No. 5,181,162 (Jan. 19, 1993) disclose anobject-oriented document management and production system in whichdocuments are represented as collections of logical components that maybe combined and physically mapped onto a page-by-page layout.

D. Dodge, S. Follett, A. Grecco, J. Tillman, Method and apparatus fordocument production using common document database, U.S. Pat. No.5,655,130 (Aug. 5, 1997) describe a system and method for producing avariety of documents from a common document database. In the describedsystem, source documents are decomposed into encapsulated data elements,in which a data element includes the actual content along withclassifying data about the content. The encapsulated data elements aresaved to a database, and can be later reassembled to form variationspecific documents.

All of the systems described above involve the decomposition of sourcedocuments into smaller components, storing the document components in adatabase and reassembling the document components to form differentvariations of the source document, or completely new documents. Whilethese systems facilitate the building of variation specific documentssuch as software documentation, and other engineering documents, theyonly involve combining and recombining static elements in various ways.The disclosed systems don't provide any way of generating a document “onthe fly” that incorporates dynamically discovered information.Furthermore, none of the systems described concern themselves withoptimizing the process of incorporating dynamic information into anonline document by reducing the required computer resource usage.

Various other methods have been proposed for creating dynamic content inpages for delivery to a client over the Internet on the World-Wide Web(WWW). For example, JAVA SERVER PAGES from Sun Microsystems, Inc. ofMenlo Park Calif. or ACTIVE SERVER PAGES from Microsoft Corporation ofRedmond Wash. create all of the page content by having the page's Javaor C++ server code write all of the page content to the client browser(the output stream). The major drawback of these solutions is that theserver code and the page design (the HTML) are both contained in thesame HTML file making it extremely difficult for non-programmers (e.g.graphic artists) to use popular page design tools to modify the contenton these pages

The primary task of Internet Web server computers is to deliver content(Web pages) to client computers (Web browsers). These server computersare expected to perform these operations extremely rapidly because theyare being besieged by, potentially, thousands and thousands of clientrequests per second. For this reason web developers attempt to reducebottlenecks in the server software so that the server is performing upto its maximum capacity. The problem, then, arrives when many tokens inmany dynamic pages need to be repeatedly replaced with dynamic content.Though the example in the preceding paragraph only contained a singletoken, the reality is that dynamic Web pages are normally far morecomplex than in this example, and might have 20 or more tokens.

Without any optimization, on each request, the server would have tore-read the base HTML file from disk, search and replace all occurrencesof each token, and then write the newly created content stream to theclient's return stream. The problem with this approach is that it isextremely time consuming. Even if the file is read from disk only once(i.e. it's cached) the act of replacing all occurrences of all tokens inthe file is a very slow and very costly operation. It is so slow that itwould likely be the primary bottleneck in the server software.Additionally, buying more hardware, bandwidth, etc. will not solve thisproblem because no matter how many machines were running concurrently,on each client request each web server would have to re-read andre-replace the page's content.

There exists, therefore, a need in the art for a way to reduce theprocessing overhead required to parse the HTML code of a web page thatrequires the incorporation of dynamic content in order to locate areas,identified by tokens, wherein the dynamic contented is to be inserted,and replacing the tokens with the dynamic content.

SUMMARY OF THE INVENTION

The invention provides a process in which an HTML file for a web pageincorporating dynamic content is read and parsed once and only oncethroughout the life of the server. The dynamic HTML file is read fromthe server's local disk. A ContentComposer, a software object, parsesthe HTML and decomposes the page into its constituent pieces, which arestored in multiple data structures. The data structures and theContentComposer are cached, allowing extremely rapid access. Forsubsequent page requests, the dynamic page is created from the in-memoryversion. This in-memory version can be shared across users and acrossrequests. Reading and decomposing the HTML file and performing tokenreplacement is so highly optimized that server resource usage, (memory,CPU, etc.) is near zero.

While the preferred embodiment provides a particular implementationdirected to replacement of tokens in HTML files by a web server, it isgenerally applicable to any situation in which documents need to be

-   -   be editable using reasonable and current tools; and    -   be dynamic in that some of its content is static but other        pieces of its content are created dynamically and will likely        change from one creation to the next.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 provides a top-level block diagram of a process for optimizinggeneration of a computer readable document incorporating static anddynamic content, according to the invention;

FIG. 2 provides a block diagram of a sub-process for “freeze-drying” rawcontent from a document template, according to the invention;

FIG. 3 provides a block diagram of a plurality of data structures forstoring the freeze-dried content of FIG. 2, according to the invention;and

FIG. 4 provides a block diagram of a sub-process for composing adocument from the freeze-dried content of FIG. 3, according to theinvention.

DETAILED DESCRIPTION Overview

A description of the details and function of the present invention isprovided below. The source code listed in APPENDIX A, written in JAVA,details the implementation of a preferred embodiment of the invention.The patentee has no objection to the reproduction of the source code orother information for the purpose of obtaining and maintaining a validpatent. However, the patentee otherwise reserves all copyrightinterests.

The invention is embodied as both a process to be executed on acomputer, typically a web server, and a computer program productproviding computer readable program code means for executing the varioussteps of the process. The computer readable program code is embodied ona computer readable medium. The computer readable medium may be eitherfixed, such as a mass storage device or a memory, or it may beremovable, such as a CD or a diskette. The invention is implementedthrough the use of conventional computer programming techniques wellknown to those skilled in the art. While the source code provided in theattached appendix is written in JAVA, other programming languages wouldalso be suitable for programming the invention. While the invention ispreferably programmed in an object-oriented language such as JAVA orC++, other embodiments, consistent with the spirit and scope of theinvention, programmed in procedural languages or scripted languages, arealso possible.

Referring now to FIG. 1, the invention provides a process for optimizinggeneration of a computer readable document incorporating static anddynamic content 10, particularly web pages being served up to a clientin response to a request from a user. As previously mentioned, one ofthe most common ways of generating web pages having dynamic content isto start with a page template. Typically, the page template is a file ofHTML code containing placeholders where the dynamic content is to beinserted. The placeholders usually consist of tokens. For example“@Username@” might be typically used as a placeholder for a user's name.When the template is created, or after it is edited, it is saved todisk, typically on a web server. Thereafter, the HTML file is read fromthe disk and parsed to locate the “live” or dynamic sections, which havebeen set off or reserved by the tokens. The invention provides a processin which the HTML file need be read from disk and parsed only once,unlike prior art methods, which require that the file be read and parsedevery time a client requests the page.

File Reads

In the current embodiment of the invention, the HTML file is read fromthe disk 11 by means of a helper software object tasked with variousutility file operations, such as reading in files, getting file listsand so on. Reading pages of “static” content is performed by a“getContent( )” method embodied in the helper object. The getContent (0method of the helper object retrieves the raw HTML file and stores theraw content to the cache as a string. More detailed descriptions of theoperation of the helper object and the “getContent( )” method are to befound by referring to the documentation provided in the enclosedAppendix.

Content Composer

When parsing the HTML file for caching and token replacement purposes,the goal is to separate the HTML file into its component static pieces,dynamic pieces, and replaceable token pieces. A common term of art forthis process is “freeze-drying” 12. The invention provides aContentComposer class that is the sole parser and manager of thisfreeze-dried content. Each HTML file has a separate instance of theContentComposer object associated with it. In keeping with conventionalmethods of object-oriented programming, in which an object includes bothinstructions and the associated data, the ContentComposer object for aparticular page includes the implementation logic and the raw contentstring. When a file is loaded, the helper object checks to see if aContentComposer object exists for the file. If the file has noassociated ContentComposer object, the helper object creates one 20. Aglobal HashMap, held in the cache, provides storage for ContentComposerobjects. Thus, following creation of the ContentComposer, the newContentCompser object is stored to the global Hashmap. In this way, thedeconstructed file data is effectively cached, so that it may be used onsubsequent invocations 21.

After being cached, ContentComposer parses the HTML file by“decomposing” the raw code string, separating it into its variouscomponents 22. Components are one of three types:

-   -   blocks of immutable content containing no tokens;    -   lines of immutable content that surround tokens; and    -   token replacement values.

According to a preferred embodiment of the invention, a token comprisesa string that starts and ends with the “@” characters and contains noembedded white space, newline characters, colons, semi-colons, orcommas. However, the delimiting characters are a mere matter of choice,dictated in this case by the conventional manner of creating tokenizedHTML code.

In some cases, only the token is replaced, in other cases, the entireline containing the token is replaced. For example, the method allowscalling processes to replace the whole line of text that the token wason, which is a frequent operation for callers replacing <li> or <select>items.

As previously described, the helper object provides a raw code string tothe ContentComposer for parsing. A setContents( ) method within theContentComposer provides most of the parsing logic for the invention.The setContents( ) method parses the raw content string to locatedelimiting characters. Upon locating a delimiting character, the parsingengine evaluates the string for the presence of the previously indicatedillegal characters—white space, newline characters, colons, semi-colons,or commas. The presence of any illegal characters indicates that thedelimiting character is not associated with a valid token. “@foo bar@”or “keith@iamaze.com” are examples of such invalid strings. As thevarious page components are identified, they are stored to one ofseveral data objects that are also associated with the ContentComposer.After the page components are identified, the page is decomposed bysaving the separate components to a plurality of data structures 23.These data structures are described in greater detail below. It shouldbe noted that the process of separating the page into components andstoring them in the data structures constitutes the process commonlyknown as “freeze-drying.” While, for the purpose of description, thedata and the data structures are described separately from the logic andinstructions, they are, in fact, all associated within a singleContentComposer object, which is held in the cache. Thus, as with theraw code string, the data structures containing the page components areeffectively cached, eliminating the necessity of any further diskaccesses when the HTML file is composed.

After the page components are cached, calling processes can ask theContentComposer to perform token replacement, which it can do very fast:in 0-1 time, the tokens are stored in a HashMap as described below. Thefinal part of SXContentComposer's lifecycle is when the caller asks theContentComposer to “compose( )” itself, thus creating a page fordownload to a client 13. The compose( ) method itself providesadditional important performance gains. Rather than recomposing the HTMLinto a string, and passing the string to the calling process, extremelywasteful of memory and processor time, the ContentComposer walks throughthe data structures and writes the data to an output stream as it iswalking 14.

This implementation holds three primary data structures. It is necessaryto hold this parsed data in three disparate, but linked, data structuresbecause the data must be accessed from a number of different “angles”,and for a number of different purposes. The composer will need access toall the original static text, plus some way to gather the tokenreplacement values. The caller will need to replace token values (byspecifying the token name), or the whole line the token is appears on.The caller may also want to inspect the line a token appears on.

Data Structures

The three primary data structures are as follows:

The first is an array of immutable content broken up into “chunks” 30.Each chunk is either a text block with no “@foo@” tokens, or it is aninteger object pointing to the index of a token replacement object,(SXTokenLine) which will supply the values (string) for that chunk.

The second data structure is also an array of immutable content: anarray of the token-replacement-objects mentioned above 31, and pointedto by the chunks array. These token-replacement-objects are of typeToken Line and they hold the static text that immediately precedes andfollows a token. They also hold the raw token name itself (e.g.“@FooBar@”) as well as a pointer to an object stored within the thirddata structure, a structure that holds the replacement line orreplacement value associated with this token. This final object is oftype Token. While the names assigned to the various page component typesin the current embodiment are descriptive of their content, they areprimarily a matter of choice.

The third data structure is a HashMap with all the tokens from the rawcontent as keys and all the replacement values set by the callingprocess as the values 32. These replacement values are of type TokenObject, which can hold a replacement line or a replacement value for atoken.

Note that the immutable text chunks never change throughout the life ofthis object, while the values stored in the tokens and replacementvalues HashMap are likely to change every time content is created, sincetokens and replacement values represent the dynamic portion of thecontent.

Furthermore, to reduce the overhead of future writes to streams, and toreduce the excessive creation of string objects, the static data in boththe immutable text chunks array as well as the immutable token linesarray is stored as byte( ) rather than string.

Compose( ) Method

The Compose( ) method of the ContentComposer writes each text chunk andtoken replacement value to an output stream in sequential order,creating a single, coherent, token-replaced text stream.

As the ContentComposer walks the immutable text chunks array 40, if itencounters an array entry that is a token rather than a chunk of text,instead of concatenating the actual token, it concatenates the value forthe token found in the tokens and replacement values HashMap 41.

The specific process and data structures used by the ContentComposer aredescribed in greater detail in the example provided below.

Example Sample Raw Content

<html> <title> iAmaze Presentation Tool </title> <h1> Welcome to iAmaze,@UserName@! </h1> <br> <br> Would you like to work on the presentationyou last worked on, named @LastPresentation@? <br> If so, click here.</html>Sample Raw Data Structures Created from Raw Content:

Immutable Text Chunks Array:

immutableTextChunksArray[0] = “<html> <title>iAmaze Presentation Tool</title>” immutableTextChunksArray[1] = “new integer(0)” (use to lookup, at index=0, this token's pre- & post- SXTokenLine line text objectsin the “immutableTokenLines” array.) immutableTextChunksArray[2] = “!</h1> <br> <br>” immutableTextChunksArray[3] = “new Integer(1)” (indexinto “immutableToken|Lines” array, above and below)immutableTextChunksArray[4] = “? <br> If so, click-here. </html>”

Immutable Token Lines Array:

ImmutableTokenLinesArray[0] = {SXTokenLine{prefix= “<hi>Welcome toiAmaze, ” suffix=”! </h1>”, pointer to SXToken object in thetokensAndReplacementValues} ImmutableTokenLinesArray[1] ={SXTokenLine{prefix=”Would you like to work on the presentation you lastworked on, named: “, suffix=””,pointer to SXToken object in thetokensAndReplacementValues}

Tokens and Replacement Values HashMap:

TokensAndReplacementValues={{“@UserName@”,SXToken{replacementForToken=null,replacementForTokenLine=null}}, {“@LastPresentation@”,SXToken{replacementForToken=null, replacementForTokenLine==null}}}

Thus, the data structures for the example page appear as shown aboveimmediately after the parsing or “freeze-dry” process. After beingsupplied values by calling process, for example, in response to arequest from a user, two separate methods are called to replace thetokens with the new content:

After calls to:

anSXContentComposer.replaceLineContainingToken(“@UserName@”,“< h1>Welcome to work, Keith! </h1>”);anSXContentComposer.replaceToken(”@LastPresentation@”,“1999 Harleys”);

The tokens and replacement values HashMap look as below:

tokensAndReplacementValues={{“@UserName@”,SXToken{replacementFortoken=null,replacementForTokenLine=”<h1>Welcome to work, Keith! </h1>”}},{“@LastPresentation@”, SXToken {replacementForToken = “1999Harleys”,replacementForTokenLine=null}}}

The first call replaces the entire line containing the token. The secondcall replaces only the token. The immutable text chunks and theimmutable token lines arrays remain the same, since they containimmutable data.

A call to SXContentComposer's Compose( ) or toString( ) methodsgenerates the following:

<html> <title> iAmaze Presentation Tool </title> <h1> Welcome to work,Keith! </title> <br> <br> Would you like to work on the presentation youlast worked on, named 1999 Harleys? <br> If so, click here. </html>

The toString( ) method outputs a string to the output stream in afashion similar to the Compose( ) method. More detailed description ofthe toString( ) method as well as the replaceLineContainingToken( ), andreplaceToken( ) methods is to be found below.

The following is the source code for an implementation of the presentinvention, written in JAVA:

Although the invention has been described herein with reference tocertain preferred embodiments, one skilled in the art will readilyappreciate that other applications may be substituted without departingfrom the spirit and scope of the present invention. Accordingly, theinvention should only be limited by the Claims included below.

1. A process for optimizing generation of a computer readable documentincorporating static and dynamic content, comprising the steps of:providing a template file of said document, said file resident on a massstorage device of a first computer; reading said template into memory;creating a content composer, said content composer comprising a firstsoftware object; parsing said template by said content composer;decomposing said template into separate page components by said contentcomposer; converting said components into strings of computer readablecode by said content composer; storing said strings to one or more datastructures; and caching said data structures containing said pagecomponents.